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Abstract 

This paper proposes methods to detect outliers in functional data sets and the 
task of identifying atypical curves is carried out using the recently proposed kernelized 
functional spatial depth (KFSD). KFSD is a local depth that can be used to order 
the curves of a sample from the most to the least central, and since outliers are usu¬ 
ally among the least central curves, we present a probabilistic result which allows to 
select a threshold value for KFSD such that curves with depth values lower than the 
threshold are detected as outliers. Based on this result, we propose three new outlier 
detection procedures. The results of a simulation study show that our proposals gen¬ 
erally outperform a battery of competitors. We apply our procedures to a real data 
set consisting in daily curves of emission levels of nitrogen oxides (NO x ) since it is 
of interest to identify abnormal NCU levels to take necessary environmental political 
actions. 

Keywords: Functional depths; Functional outlier detection; Kernelized functional spatial 
depth; Nitrogen oxides; Smoothed resampling. 


1 


1 INTRODUCTION 


The accurate identification of outliers is an important aspect in any statistical data analysis. 
Nowadays there are well-established outlier detection techniques in the univariate and mul¬ 


tivariate frameworks (for a complete review of the topic, see for example 


Barnett and Lewis 


19941 ). In recent years, new types of data have become available and tractable thanks to 


the evolution of computing resources, e.g., big multivariate data sets having more variables 
than observations (high-dimensional multivariate data) or samples composed of repeated 
measurements of the same observation taken over an ordered set of points that can be in¬ 
terpreted as realizations of stochastic processes (functional data). In this paper we focus on 
functional data, which are usually studied with the tools provided by funct i onal data analysis 


(FDA ). For overviews on FDA method s, s eelRamsav and Silvermanl (120051 1. 


Ferratv and Vieu 


(120061) , iHorvath and Kokoszkal (120121) or ICuevasI (I2014J). For environment al statistical prob- 


Ignaccolo et aj (1201511 . 


Menafoglio et a] 


l ems t ackle d using FDA techniques, s ee for example 
( 2014 ) and Ruiz-Medina and Espeio ( 2012 ). 

As in univariate or multivariate analysis, the detection of outliers is also fundamental 
in FDA. According to F ebrero e t all (120071 . 2008 ). a functional outlier is a curve generated 
by a stochastic process with a different distribution than the one of normal curves. This 
definition covers many types of outliers, e.g., magnitude outliers, shape outliers and partial 
outliers, i.e., curves having atypical behaviors only in some segments of the domain. Shape 
and partial outliers are typically harder to detect than magnitude outliers (in the case of high 
magnitude, outliers can even be recognized by simply looking at a graph), and therefore entail 
more challenging outlier detection problems. In this paper we focus on samples contaminated 
by low magnitude, shape or partial outliers. 

Specifically, we consider a real data set cons isting in nitrogen oxides (NCR) emission daily 


levels measured in the Barcelona area (see 


Febrero et al 


2008 


for a first analysis of this data 


set). Since NCR represent one of the most important pollutants, cause ozone formation and 
contribute to global warning, it is of interest the identification of days with abnormally large 
NCR emissions to allow the implementation of actions able to control their causes, which are 
primarily the combustion processes generated by motor vehicles and industries. 
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We propose to detect functional outliers using the notion of functional depth. A functional 
depth is a measure providing a P-based center-outward ordering criterion for observations of 
a functional space HI, where P is a probability distribution on HI. When a sample of curves is 
available, a functional depth orders the curves from the most to the least central according 
to their depth values and, if any outlier is in the sample, its depth is expected to be among 
the lowest values. Therefore, it is reasonable to build outlier detection methods that use 
functional depths. 

In this paper we enlarge the number of available functional outlier detection procedures 


by presenting t 
depth (KFSD, 


iree new methods based on a specific depth, the kernelized functional spatial 


Sguera et a] 


20141. KFSD is a local-oriented depth, that is, a depth which 


orders curves looking at narrow neighborhoods and giving more weight to close than distant 
curves. Its approach is opposite to what global-oriented depths do. Indeed, any global depth 
makes depend the depth of a given curve on the whole rest of observations, with equal weights 


for all of them. This is th e case o f a g loba 


(FSD, 


Chakrabortv and Chaudhuri 


-oriented depth such as the functional spatial depth 


2014), of which KFSD is its local version. A local depth 


such as KFSD may result useful to analyze functional samples having a structure deviating 
from unimodality or symmetry. Moreover, the local approach behind KFSD proved to be 


a good strategy in supervi sed c 


clear-cut (see 


Sguera et al 


assihcation problems with groups of curves not extremely 


2 0141 ). Alternatively, we illustrate that KFSD ranks well low 


magnitude, shape or partial outliers, that is, their corresponding KFSD values are in general 
lower than those of normal curves. Then, we propose different procedures to select a threshold 
for KFSD to distinguish between normal curves and outliers. These procedures employ 
smoothing resampling techniques and are based on a theoretical result which allows to obtain 
a probabilistic upper bound on a desired false alarm probability of detecting normal curves 
as outliers. Note that the probabilistic foundations of the proposed methods represent a 
novelty in FDA outlier detection problems. We study the performances of our procedures 
in a simulation study and analyzing the NCK data set. We show this data set in Figured! 
where it is already possible to appreciate that the presence of partial outliers is an issue. 


We compare our methods with some alternative outlier detection procedures: 


Febrero et al 


(120081 ) proposed to label as outliers those curves with depth values lower than a certain 
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Figure 1: NCR levels measured in ^ig/m 3 every hour of 76 working days between 
23/02/2005 and 26/06/2005 in Poblenou, Barcelona. 


thres hold. As functional d epths, they conside red the Fraiman and Muniz den th flFralrriari and Muni? 


2001), the h-modal depth (iCuevas et aI2006h and the integrated dual depth (I Cuevas and Fraiman 


2009). To determine the depth threshold, they proposed two different bo otstrap procedures 


based on depth-based trimmed or weighted resampling, respectively; 


Sun and Genton (2011) 


introduced the functional boxp 
by the modified band depth 


ot, w hich i s const ructed using the ranking of curves provided 


Lopez-Pintado and Romo 


20091 ). The proposed functional 


boxpl ot detects outlie rs us ing a rule that is similar to the one of the standard boxplot; 


Hv ndman and Shan gl 020101) proposed to reduce the outlier detection problem from func¬ 


tional to multivariate data by means of functional principal component analysis (FPCA), 
and to use two alternative multivariate techniques on the scores to detect outliers, i.e., the 
bagplot and the high density region boxplot, respectively. 

The remainder of the article is organized as follows. In Section [2] we recall the definition 
of KFSD. In Section [3] we consider the functional outlier detection problem. In Theorem [Tj 
we present the result on which are based three new outlier detection methods which employ 
KFSD as depth function. In Section [4] we report the results of our simulation study, whereas 
in Section [5] we perform outlier detection on the NCR data set. In Section [6] we draw some 
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conclusions. Finally, in the Appendix we report a sketch of the proof of Theorem [fj 


2 THE KERNELIZED FUNCTIONAL SPATIAL 

DEPTH 

In functional spaces a depth measure has the purpose of measuring the degree of central¬ 
ity of curves relative to the distribution of a functional random variable. Various func¬ 
tional depths have been proposed following two alternative approaches: a global approach, 
which implies that the depth of an observation depends equally on all the observations 
allowed by P on HI, and a local approach, which instead makes depend the depth of an 


observation more on close than distant observations. Among the existing g 


obal-oriented 


depths there is the ' 

Tukev depth (RTD, 

h 

airnan and Muniz depth (FMD, 

Tairnan and Muniz 2001), the random 

Cuesta-Albertos and Nieto-Reves 

2008) 

, the integrated dual depth 

(IDD, 

Cuevas and Fraimanl 

2009 

), the modified band depth (MBD 

Lopez-Pintado and Romo 

2009) 


or the functional spatial depth (FSD, 


Chakrabortv and Chaudhur 


oriented depths are instead the h-modal dept 


functional spatial depth (KFSD, S guera e t a . 


h (HMD, 


2014). 


Cuevas et al 


2 0141). Proposals of local- 


2006) or the kernelized 


In this paper we focus on KFS D. Before giving its definition, we recall the definition of 


the functional spatial depth (FSD, 


Chakrabortv and Chaudhuri 


2014). Let HI be an infinite¬ 


dimensional Hilbert space, then for x G HI and the functional random variable Y G HI, FSD 
of x relative to Y is given by 


FSD(x,Y ) = 1 


E 


x-Y 


_\\x-Y\\_ 

where || • || is the norm inherited from the usual inner product in HI. For a n-size random 
sample of Y, i.e., Y n = {yi,... ,y n }, the sample version of FSD has the following form: 


FSD(x,Y n ) = 1- 


n 


E 

i= 1 


X - Vi 


F - Vi\ 


( 1 ) 


As mentioned before, FSD is a global-oriented depth and KFSD is a local version of 
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it. KFSD is obtained writing (ITj) in terms of inner products and then replacing the inner 
product function with a positive definite and stationary kernel function. This replacement 
exploits the relationship 


«0, y) = {(/>(x), 00)), X, y e H, (2) 

where k is the kernel k : El x El —» M, 0 is the embedding map 0 : El — y F and F is a feature 
space. Indeed, a definition of KFSD in terms of 0 can be given, that is, 


KFSD(x, Y) 


1 - 


E 


00 ) - <t>{y) 

.1100) -00') II. 


(3) 


and it can be interpreted as a recoded version of FSD(x,Y ) since KFSD(x,Y ) = 
FSD(4>(x),4>(Y)). The sample version of (|3]) is given by 


KFSD(x,Y n ) = 1 - 


n 


£ 


</> 0 ) - 00 * 


j^i loo) -0001 

Then, standard calculations (see Appendix) and (J2J) allow to provide an alternative expression 
of KFSD(x, Y n ), in this case in terms of k: 


( 


1 

n 


£ 


KFSD(x,Y n ) = 1- 

«0> x ) + K (y» Vi) ~ «0> Vi) ~ «0> Vi) 


\ 


1/2 


i J= i ; 0^0,0 + K{yi,yi) - 2 K,{x,yi)y/ k{x,x) + ^0/,%) - 2«0,2/j) 

\yi¥=x-,yj^x 


(4) 


Note that ([4j) only requires the choice of k, and not of 0, which can be left implicit. As k 
we use the Gaussian kernel function given by 


«0» 2/) = ex P , (5) 

where In turn, (j3D depends on the norm function inherited by the functional Hilbert 

space where data are assumed to lie, and on the bandwidth a. Regarding a, we initially 
consider 9 different a, each one equal to 9 different percentiles of the empirical distribution 
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of {||yi — yj\\,yi, Uj € Y n }. The first percentile is 10%, and by increments of 10 we obtain the 
ninth percentile, i.e., 90%. Note that the lower cr, the more local the approach, and therefore 
the percentiles that we use cover different degrees of KFSD-based local approaches: strongly 
(e.g., 20%), moderately (e.g., 50%) and weakly (e.g., 80%) local approaches. In Section [4] 
we present a method to select a in outlier detection problems. 

In general, since any functional depth measures the degree of centrality or extremality 
of a given curve relative to a distribution or a sample, outliers are expected to have low 
depth values. More in particular, in presence of low magnitude, shape or partial outliers, an 
approach based on the use of a local depth like KFSD may help in detecting outliers. To 
illustrate this fact, we present the following example: first, we generated 100 data sets of 
size 50 from a mixture of two stochastic processes, one for normal curves and one for high 
magnitude outliers, with the probability that a curve is an outlier equal to 0.05. Second, we 
generated a group of 100 data sets from a mixture which produces shape outliers. Finally, 
we generated a group of 100 data sets from a mixture which produces partial outliers. In 
Figure [2] we report a contaminated data set for each mixture. 



Figure 2: Examples of contaminated data sets: high magnitude contamination 
(top), shape contamination (middle) and partial contamination (bottom). The 
solid curves are normal curves and the dashed curves are outliers 


Let n ou t } j,j = 1,..., 100, be the number of outliers generated in the jth data set. For 
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each data set and functional depth, it is desirable to assign the n ouij lowest depth values 
to the n out j generated outliers. For each mixture and generated data set, we recorded how 
many times the depth of an outlier is among the n ou tj lowest values. As depth functions, 
we considered five global depths (FMD, RTD, IDD, MBD and FSD) and two local depths 
(HMD and KFSD). The results reported in Table [T] show that for all the functional depths 
the ranking of high magnitude outliers is an easier task than the ranking of shape and 
partial outliers. However, while the ranking of high magnitude outliers is reasonably good in 
different cases, e.g., for the local KFSD (94.87%) and the global RTD (90.17%), the ranking 
of shape and partial outliers is markedly better with local depths (shape: 86.72% for KFSD 
and 85.47% for HMD; partial: 82.03% for KFSD and 81.25% for HMD) than with the best 
global depths (shape: FSD with 39.06%; partial: FSD with 46.48%). These results suggest 
that, selecting a proper threshold, KFSD can isolate well outliers. 


Table 1: Percentages of times a depth assigns a 
value among the n ou tj lowest ones to an outlier. 
Types of outliers: high magnitude, shape and par¬ 
tial. 


type of depths 

global depths 

local depths 

depths 

FMD 

RTD 

IDD 

MBD 

FSD 

HMD 

KFSD 

high magnitude outliers 

86.32 

90.17 

81.62 

69.23 

68.80 

85.47 

94.87 

shape outliers 

7.81 

33.59 

38.67 

12.11 

39.06 

85.94 

86.72 

partial outliers 

18.75 

44.53 

34.77 

19.14 

46.48 

81.25 

82.03 


3 OUTLIER DETECTION FOR FUNCTIONAL 

DATA 

The outlier detection problem can be described as follows: let Y n = {y ±,..., y n } be a sample 
generated from a mixture of two functional random variables in H, one for normal curves 
and one for outliers, say Y nor and Y out , respectively. Let Y mix be a mixture, i.e., 

{ Y nor , with probability 1 — a, 

( 6 ) 

Yout, with probability a, 


where a G [0,1] is the contamination probability (usually, a value rather close to 0). The 








curves composing Y n are all unlabeled, and the goal of the analysis is to decide whether each 
curve is a normal curve or an outlier. 


KFSD is a functional exte nsion of the kernelized spatial depth for multivariate data 


(KSD) proposed by Chen et al (2009'), who also proposed a KSD-based outlier detector that 


we generalize to KFSD: for a given data set Y n generated from Y mix and t G [0,1], the 
KFSD-based outlier detector for x G El is given by 



1, if KFSD(x, Y n ) < t, 
0, if KFSD(x,Y n ) > t, 


(7) 


where t is a threshold which allows to discriminate between outliers (i.e., g(x,Y n ) = 1) and 
normal curves (i.e., g(x, Y n ) = 0), and it is a parameter that needs to be set. 

For the multivariate case, KSD-based outlier detection is carried under different scenarios. 
One of them consists in an outlier detection problem where two samples are available and the 
threshold t is selected by controlling the probability that normal observations are classified 
as outliers, i.e., the false alarm probability (FAP). The selection criterion is based on a result 
providing a KSD-based probabilistic upper bound on the FAP which depends on t. Then, 
the threshold for KSD is provided by the maximum value of t such that the upper bound 
does not exceed a given desired FAP. We extend this result to KFSD: 

Theorem 1 Let Y ny = {y t ,... ,y nY } and Z nz = {z i: ... ,z nz } be i. i. d. samples generated 
from the unknown mixture of random variables Y mix G HI described by (G|), with a > 0. Let 
g(-,Y ny ) be the outlier detector defined in &■ Fix 5 G (0, 1) and suppose that a < r for 
some r G [0, 1]. For a new random element x generated from Y nor , the following inequality 
holds with probability at least 1 — 5: 



where Y ny refers to the expected value of x for a given Y ny . 

The proof of Theorem [Tj is presented in the Appendix. Recall that the FAP is the 
probability that a normal observation x is classified as outlier. For the elements of Theorem 


HI Pr x | Yny {,g{x,Y ny ) = 1) is the FAP. Moreover, 
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Pr*| {g{x,Y nY ) = 1) = E*!y ny \g(x,Y ny )\ . 


Therefore, the probabilistic upper bound of Theorem Q] applies also to the FAP. 

It is worth noting that the application of Theorem [T] requires to observe two samples, 
circumstance rather uncommon in classical outlier detection problems, in which usually a 
single sample generated from an unknown mixture of random variables is available. For this 
reason, we propose a solution which allows to use Theorem [T] in presence of a unique sample. 
Note that the general idea behind holds also in the multivariate framework, and therefore it 
would enable to perform KSD-based outlier detection when only a M d -sample j s available. 

In the functional context, our solution consists in setting Y ny = Y n and in obtaining 
Z nz by resampling with replacement from Y n . Note that by doing this, and for sufficiently 
large values of riz , we also obtain that the effect of 5 on the probabilistic upper bound 
drastically reduces. Concerning r, that is the upper bound for the unknown contamination 
probability a, a true range between 0 and 0.1 appears to be appropriate to cover most 
of the situations found in practice. Regarding the resampling procedure to obtain Z nz , we 
consider three different schemes, all of them with replacement. Since we deal with potentially 
contaminated data sets, besides simple resampling, we also consider two robust KFSD-based 
resampling procedures inspired by the work of Fe brer o et aj 02008 1. The three resampling 
schemes that we consider are: 


1. Simple resampling. 

2. KFSD-based trimmed resampling: once KFSDijji , Y n ),i — 1,..., n are obtained, it is 
possible to identify the \otr\% least deepest curves, for a certain 0 < ax < 1 usually 
close to 0, that we advise to set equal to r. These least deep curves are deleted from 
the sample, and simple resampling is carried out with the remaining curves. 

3. KFSD-based weighted resampling: once KFSD(yi,Y n ),i = 1, ...,n are obtained, 
weighted resampling is carried out with weights Wi = KFSD(jji,Y n ). 

All the above procedures generate samples with some repeated curves. However, in a pre¬ 
liminary stage of our study we observed that it is preferable to work with Z nz composed 
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of non-repeated curves. To obtain such samples, we add a common smoothing step to the 
previous three resampling schemes. 

To describe the smoothing step, first recall that each curve in Y n is in practice observed 
at a discretized and finite set of domain points, and that the sets may differ from one 
curve to another. For this reason, the estimation of Y n at a common set of m equidis¬ 
tant domain points may be required. Let (yi(s i),... ,yi(s m )) be the observed or estimated 
m-dimensional equidistant discretized version of yi > S Y n be the covariance matrix of the 
discretized form of Y n and 7 be a smoothing parameter. Consider a zero-mean Gaussian 
process whose discretized form has q£y n as covariance matrix. Let (£(si),..., ((s m )) be 
a discretized realization of the previous Gaussian process. Consider any of the previous 
three resampling procedures and assume that at the jth trial, j = 1,... ,nz, the zth curve 
in Y n has been sampled. Then, the discretized form of the jth curve in Z nz would be 
given by Oj(si),..., Zj(s m )) = (y,(si) + COi), • • •, yi(s m ) + C(s m )), or, in functional form, 
by Zj — yi + C- Therefore, combining each resampling scheme with this smoothing step, 
we provide three different approximate ways to obtain Z nz , and we refer to them as smo, 
tri and wei, respectively. Then, for fixed <5, r and desired FAP, the threshold t for ([7|) is 
selected as the maximum value of t such that the right-hand side of (JHJ) does not exceed 
the desired FAP. Let t* be the selected threshold, which is then used in (J7J) to compute 
g ( yi, Y n ), i — 1,..., n. If g (t/*, Y n ) = 1, y± is detected as outlier. To summarize, we provide 
three KFSD-based outlier detection procedures and we refer to them as KFSD smo , KFSD tr .j 
and KFSD^ej depending on how Z nz is obtained (smo, tri and wei, respectively; recall that 
Y ny = Y n ). As competitors of the proposed procedures, we consider the methods mentioned 
in Section [II that we now describe. 


Sun and Gentonl 1201111 proposed a depth-based functional boxplot and an associated 


outlier detection rule based on the ranking of the sample curves that MBD provides. The 
ranking is used to define a sample central region, that is, the smallest band containing at 
least half of the deepest curves. The non-outlying region is defined inflating the central 
region by 1.5 times. Curves that do not belong completely to the non-outlying region are 
detected as outliers. The original functional boxplot is based on the use of MBD as depth, 
but clearly any functional depth can be used. Another contribution of this paper is the 


11 





study of the performances of the outlier detection rule associated to the functional boxplot 
(from now on, FBP) when used together with the battery of functional depths mentioned in 
Section [2j 


Febrero et al (2008) proposed two depth-based outlier detection procedures that select a 


threshold for FMD, HMD or IDD by means of two alternative robust smoothed bootstrap 
procedures whose single bootstrap samples are obtained using the above described tri and 
wei, respectively. At each bootstrap sample, the 1% percentile of empirical distribution of 
the depth values is obtained, say p 0 .oi- If B is the number of bootstrap samples, B values 
of po.oi are obtained. Each method selects as cutoff c the median of the collection of p 0 .oi 
and, using c as threshold, a first outlier detection is performed. If some curves are detected 
as outliers, they are deleted from the sample, and the procedure is repeated until no more 
outliers are found (note that c is computed only in the Erst iteration). We refer to these 
methods as B tri and B wei , and also in this case we evaluate these procedures using all the 
functional depths mentioned in Section [2j 


Finally, we also consider two procedures proposed by iHvndman and Shana (12010 ) that 


are not based on the use of a functional depth. Both are based on the first two robust func¬ 
tional principal components scores and on two different graphical representations of them. 
The Erst proposal is the outlier detection rule associated to the functional bagplot (from now 
on, FBG), which works as follows: obtain the bivariate robust scores and order them using 
the multivariate halfspace depth ( Tukevi Il975l ). Define an inner region by considering the 
smallest region containing at least the 50% of the deepest scores, and obtain a non-outlying 
region by inflating the inner region by 2.58 times. FBG detects as outliers those curves whose 
scores are outside the non-outlying region. Note that the scores-based regions and outliers 
allow to draw a bivariate bagplot, which produces a functional bagplot once it is mapped 
onto the original functional space. The second proposal is related to a different graphical 
tool, the high density region boxplot (from now on, we refer to its associated outlier detec¬ 
tion rule as FHD). In this case, once obtained the scores, perform a bivariate kernel density 
estimation. Define the (1 — /3)-high density region (HDR), [3 G (0, 1), as the region of scores 
with coverage probability equal to (1 — /3). FHD detects as outliers those curves whose scores 
are outside the (1 — /3)-HDR. In this case, it is possible to draw a bivariate HDR boxplot 
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which can be mapped onto a functional version, thus providing the functional HDR boxplot. 


4 SIMULATION STUDY 

After introducing KFSD smo , KFSD tr j and KFSD we j, their competitors (FBP, B tri , B wei , FBG 
and FHD), as well as seven different functional depths (FMD, HMD, RTD, 1DD, MBD, FSD 
and KFSD), in this section we carry out a simulation study to evaluate the performances of 
the different methods. For FBP, B tri and B wei , we use the notation procedure+depth: for 
example, FBP+FMD refers to the method obtained by using FBP together with FMD. 

To perform our simulation study, we consider six models: all of them generate curves 
according to the mixture of random variables Y m i X described by The first three mixture 
models (MM1, MM2 and MM3) share Y nor , with curves generated by 

y(s) =4s + e(s), (9) 

where s G [0,1] and e(s) is a zero-mean Gaussian component with covariance function given 
by 


E(e(s), e(s / )) = 0.25 exp (—(s — s') 2 ), s, s' G [0,1]. 

Also the remaining three mixture models (MM4, MM5 and MM6) share Y nor , but, in this 
case, the curves are generated by 


y(s) = sins + w 2 coss, (10) 

where s G [0, 27 t] and U\ and U 2 are observations from a continuous uniform random variable 
between 0.05 and 0.15. 

MM1, MM2 and MM3 differ in their Y out components. Under MM1, the outliers are 
generated by 


y(s) = 8s — 2 + e(s), 
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which produces outliers of both shape and low magnitude nature. Under MM2, the outliers 
are generated by adding to (J9]) an observation from a iV(0,1), and as result outliers are more 
irregular than normal curves. Finally, under MM3, the outliers are generated by 


y(s) = 4exp(s) + e(s), 


which produces curves that are normal in the first part of the domain, but that become 
exponentially outlying. 

Similarly, MM4, MM5 and MM6 differ in their Y out components. Under MM4, the outliers 
are generated replacing U 2 with 113 in (ITU]) , where 113 is an observation from a continuous 


uniform random variable between 0.15 and 0.17. This change produces partial low magnitude 
outliers in the first and middle part of the domain of the curves. Under MM5, the outliers 
are generated by adding to (fTOl) an observation from a iV(0, (yr)"), and they turn out to be 
more irregular curves. Finally, under MM6, the outliers are generated by 



( 11 ) 


where U 4 is an observation from a continuous uniform random variable between 0.1 and 0.15. 


As MM3, MM6 allows outliers that are normal in the first part of the domain and become 
outlying with an exponential pattern. In Figure [3] we report a simulated data set with at 
least one outlier for each mixture model. 

The details of the simulation study are the following: for each mixture model, we gen¬ 
erated 100 data sets, each one composed of 50 curves. As mentioned above, for each sin¬ 
gle samples Theorem |T] cannot be directly applied, and therefore KFSD smo , KFSD tri and 
KFSD^ e j represent practical alternatives. Two values of the contamination probability a 
were considered: 0.02 and 0.05. All curves were generated using a discretized and finite 
set of 51 equidistant points in the domain of each mixture model ([0,1] for MM1, MM2 and 
MM3; [0, 27 t] for MM4, MM5 and MM6) and the discretized versions of the functional depths 
were used. 

In relation with the methods and the functional depths that we consider in the study, 
their specifications are described next: 
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MM1 


MM4 




Figure 3: Examples of contaminated functional data sets generated by MM1, 
MM2, MM3, MM4, MM5 and MM6. Solid curves are normal curves and dashed 
curves are outliers. 


1. FBP when used with FMD, HMD, RTD, IDD, MBD, FSD and KFSD: regarding FBP, 
as reported in Section [3j the central region is built considering the 50% deepest curves 
and the non-outlying region by inflating by 1.5 times the central region. Regarding 


the depths, for HMD, we follow the recommendations in 


Febrero et al (2008), that is, 


HI is the L 2 space, k(x, y ) = 


s/2tt 


exp 


2 h 2 


and h is equal to the 15% percentile 


of the empirical distribution of {||j/j — yj\\, y tl y 3 e Y n }. For RTD and IDD, we work 
with 50 projections in random Gaussian directions. For MBD, we consider bands 
defined by two curves. For FSD and KFSD, we assume that the curves lie in the L 2 
space. Moreover, in KFSD we set a equal to a moderately local percentile (50%) of 
the empirical distribution of {11 Ui — Z/yIh Ui , Vj £ Y n }. 

2. B tr i and B we i when used with FMD, HMD, RTD, IDD, MBD, FSD and KFSD: 7 = 
0.05, B = 200, = a. Regarding the depths, we use the specifications reported for 

FBP. 


3. FBG: as reported in Section [3J. the central region is built considering the 50% deepest 
bivariate robust functional principal component scores and the non-outlying region by 
inflating by 2.58 times the central region. 
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4. FHD: p = a. 

5. KFSD smo , KFSDf ri and KFSD^ey rty — n — 50 (since Y ny = Y n ), 7 = 0.05, «t = a 
(only for KFSD^), riz = 6 n, 5 = 0.05, r = a, desired FAP = 0.10. Moreover, as 
introduced in Section [2l for these methods we consider 9 percentiles to set a in KFSD. 
The way in which we propose to choose the most suitable percentile for outlier detection 
is presented below. 

In supervised classification, the availability of training curves with known class mem¬ 
berships makes possible the definition of some natural procedures to set o for KFSD, such 
as cross-validation. However, in an outlier detection problem, it is common to have no in¬ 
formation whether curves are normal or outliers. Therefore, training procedures are not 
immediately available. 

We propose to overcome this drawback by obtaining a “training sample of peripheral 
curves”, and then choosing the percentile that ranks better these peripheral curves as final 
percentile for KFSD in KFSD smo , KFSD fri and KFSD^ e j. We now describe this procedure, 
which is based on J replications. Let Y n be the functional data set on which outlier detection 
has to be done and let Y( n ) = { 2 /( 1 ),..., y( n )} be the depth-based ordered version of Y n , where 
?/(i) and y( n ) are the curves with minimum and maximum depth, respectively. The steps to 
obtain a set of peripheral curves are the following: 

I. Let {pi,... ,Pk} be the set of percentiles in use (in our case, as explained in Section [2] 
Pk = (10/c)%, k G {1,..., K — 9}), and choose randomly a percentile from the set. For 
the j'th replication, j G {1,..., J}, denote the selected percentile as pi. We use J = 20 
in the rest of the paper. 

II. Using p 7 , compute KFSD p j{y il Y n ) ) i = 1 ,...,n, where the notation KFSD p j(-, •) is 
used to describe what percentile is used. For the jth replication, denote the KFSD- 
based ordered curves as y(i),j, ■ ■ ■, y{n),j- 

III. Take y ^ )}j ,..., where l j rs-/ Bin(n, -). Apply the smoothing step described in 

Section [3] to these curves. For the smoothing step, we use Y Yn and 7 = 0.05. For the 
jth replication, denote the peripheral and smoothed curves as y*^ ., y*j^ y 

IV. Repeat J times steps H1-HTT1 to obtain a collection of L — Y2j=i h peripheral curves, say 
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Yl (for an example, see Figure |4j. 



Figure 4: Example of a training sample of peripheral curves for a contaminated 
data set generated by MM1 with a = 0.05. The solid and shaded curves are 
the original curves (both normal and outliers). The dashed curves are the 
peripheral curves to use as training sample. 

Next, Y l acts as training sample according to the following steps: for each y*^ ■ £ Y L , (i < 
lj), and p k £ {pi,... ,p K }, compute KFSD Pk {y* [t)] , Y_ (i)J ), where Y_ {i)J = Y n \ {y^j}. At 
the end, a Lx K matrix is obtained, say Dlk = {dik} , whose kth column is composed 

k=l,...’,K 

of the KFSD values of the L training peripheral curves when the kth percentile is employed in 
KFSD. Next, let be the rank of d^ in the vector ( KFSD Pk (yi , Y n ),..., KFSD Pk (y n , Y n ), 
dik), e.g., rik is equal to 1 or n +1 if dik is the minimum or the maximum value in the vector, 
respectively. Let Rlk be the result of this transformation of D LK , and sum the elements of 
each column, obtaining a K- dimensional vector, say R^-. Since the goal is to assign ranks 
as low as possible to the peripheral curves, choose the percentile associated to the minimum 
value of R/v- When a tie is observed, we break it randomly. 

The comparison among methods is performed in terms of both correct and false outlier 
detection percentages, which are reported in Tables [2H To ease the reading of the tables, 
for each model and a, we report in bold the 5 best methods in terms of correct outlier 
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detection percentage (c)Jj For each model, 
contamination probabilities a , we report its 

Table 2: MM1, a = {0.02,0.05}. Correct 
(c) and false (f) outlier detection percentages 
of FBP, Btn, B wei , FBG, FHD, KFSD smo , 
KFSD iri and KFSD wei . 


if a method is among the 5 best ones for both 
label in bold. 

Table 3: MM2, a = {0.02,0.05}. Correct 
(c) and false (f) outlier detection percentages 
of FBP, Btn, B wei , FBG, FHD, KFSD smo , 
KFSD tri and KFSD wei . 



a = 0.02 

a = 

0.05 


a = 0.02 

a = 

0.05 


c 

f 

c 

f 


c 

f 

c 

f 

FBP+FMD 

44.34 

1.23 

43.86 

0.73 

FBP+FMD 

99.09 

1.08 

96.39 

0.84 

FBP+HMD 

74.53 

0.94 

72.81 

0.61 

FBP+HMD 

96.36 

0.96 

96.39 

0.88 

FBP+RTD 

61.32 

0.57 

63.16 

0.31 

FBP+RTD 

99.09 

0.61 

94.78 

0.25 

FBP+IDD 

55.66 

0.61 

61.84 

0.34 

FBP+IDD 

99.09 

0.70 

95.18 

0.38 

FBP+MBD 

49.06 

1.33 

50.44 

0.69 

FBP+MBD 

99.09 

1.06 

96.39 

0.82 

FBP+FSD 

62.26 

0.67 

61.84 

0.40 

FBP+FSD 

99.09 

0.57 

94.78 

0.36 

FBP+KFSD 

66.04 

0.86 

74.12 

0.44 

FBP+KFSD 

98.18 

0.63 

93.98 

0.36 

Bfri+FMD 

0.00 

0.98 

0.00 

1.82 

B tri +FMD 

0.00 

1.06 

0.00 

1.96 

B lri +HMD 

66.98 

1.45 

57.89 

1.47 

B tri +HMD 

95.45 

1.51 

96.79 

1.68 

Bfri+RTD 

10.38 

1.78 

14.91 

1.76 

BfH+RTD 

1.82 

1.92 

6.83 

2.61 

Btri+IDD 

10.38 

1.55 

11.84 

1.74 

Bf r j+IDD 

5.45 

1.60 

7.63 

1.94 

Btri+MBD 

0.00 

0.51 

0.00 

1.49 

Btri+MBD 

0.00 

0.98 

0.40 

2.10 

Bt r i+FSD 

2.83 

0.76 

5.26 

1.17 

Bfw+FSD 

4.55 

1.06 

5.22 

1.62 

Bfri+KFSD 

70.75 

1.43 

58.77 

1.40 

Btri+KFSD 

97.27 

1.60 

95.18 

1.52 

B^+FMD 

0.00 

1.29 

0.00 

1.49 

B„ ei +FMD 

0.00 

1.27 

0.00 

1.52 

B^+HMD 

71.70 

1.02 

47.37 

0.65 

B raei +HMD 

95.45 

1.02 

86.35 

0.36 

B mej +RTD 

13.21 

2.04 

13.60 

1.78 

B raei +RTD 

5.45 

2.21 

8.43 

2.84 

B^ei+IDD 

17.92 

1.82 

10.53 

1.55 

B raei +IDD 

7.27 

1.49 

9.64 

2.36 

B mei +MBD 

0.00 

1.08 

0.00 

1.40 

B raei +MBD 

0.00 

1.27 

0.40 

1.49 

B mej +FSD 

2.83 

1.39 

3.95 

1.07 

B«,i+FSD 

8.18 

1.39 

4.02 

1.37 

B^+KFSD 

61.32 

0.88 

55.26 

0.48 

B^i+KFSD 

95.45 

0.96 

79.52 

0.51 

FBG 

100.00 

2.27 

97.81 

2.37 

FBG 

8.18 

3.07 

4.42 

2.95 

FHD 

48.11 

1.00 

73.68 

2.77 

FHD 

7.27 

1.88 

12.45 

5.66 

KFSD smo 

89.62 

4.50 

85.09 

2.58 

KFSD smo 

100.00 

3.91 

95.18 

2.76 

KFSD (rj 

89.62 

4.92 

92.11 

4.40 

KFSD * r ; 

100.00 

5.19 

97.99 

4.84 

KFSD mei 

97.17 

9.44 

96.93 

6.54 

KFSD„ ei 

100.00 

9.20 

99.60 

6.48 


The results in Tables [2J17] show that: 


1. KFSD tr .j and KFSD„, e j are always among the 5 best methods. KFSD smo is among the 5 
best methods 10 times over 12, but when its performance is not among the 5 best, it is 
neither extremely far from the fifth method (MM2, a = 0.05: 95.18% against 96.79%; 
MM3, a = 0.05: 73.79% against 78.63%). The rest of the methods are among the 5 
best procedures at most 4 times over 12 (FBP+HMD and B iri +HMD). 

2. Regarding MM5 and MM6, our procedures are clearly the best options in terms of 
correct detection (c), and in the following order: KFSD^ e j, KFSD tri and KFSD smo . 
In general, this pattern is observed overall the simulation study. Note that for MM6 
and a = 0.02 we observe the best relative performances of KFSD smo , KFSD^j and 
KFSD^ei, i.e., 91.58%, 93.68% and 96.84%, respectively, against 71.58% of the fourth 

1 In presence of tie, the method with lower false outlier detection percentage (f) is preferred. 
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Table 4: MM3, a = {0.02,0.05}. Correct 
(c) and false (f) outlier detection percentages 
of FBP, B tri , B wei , FBG, FHD, KFSD smo , 
KFSD^ and KFSD wei . 


Table 5: MM4, a = {0.02,0.05}. Correct 
(c) and false (f) outlier detection percentages 
of FBP, B tri , B wei , FBG, FHD, KFSD smo , 
KFSD tri and KFSD wei . 



a = 

0.02 

a = 

0.05 


a = 

0.02 

a = 

0.05 


c 

f 

c 

f 


c 

f 

c 

f 

FBP+FMD 

65.69 

0.92 

49.19 

0.97 

FBP+FMD 

1.02 

0.00 

0.00 

0.00 

FBP+HMD 

89.22 

0.57 

85.89 

0.63 

FBP+HMD 

6.12 

0.00 

1.60 

0.02 

FBP+RTD 

86.27 

0.45 

76.61 

0.34 

FBP+RTD 

0.00 

0.00 

0.00 

0.00 

FBP+IDD 

79.41 

0.51 

70.56 

0.38 

FBP+IDD 

0.00 

0.00 

0.00 

0.00 

FBP+MBD 

74.51 

0.88 

59.27 

0.84 

FBP+MBD 

0.00 

0.00 

0.00 

0.00 

FBP+FSD 

79.41 

0.51 

73.79 

0.42 

FBP+FSD 

0.00 

0.00 

0.00 

0.00 

FBP+KFSD 

89.22 

0.57 

83.06 

0.59 

FBP+KFSD 

2.04 

0.00 

0.80 

0.00 

B^ri+FMD 

2.94 

0.73 

5.24 

1.22 

B^+FMD 

60.20 

0.16 

47.60 

0.11 

B iri +HMD 

57.84 

1.57 

53.63 

1.56 

Bjri+HMD 

41.84 

0.04 

18.80 

0.17 

B tri +RTD 

15.69 

1.76 

21.37 

1.81 

B iri +RTD 

54.08 

1.16 

34.80 

0.82 

Bt r i+IDD 

20.59 

1.65 

20.56 

1.70 

B/n 1 TDD 

55.10 

1.02 

37.20 

0.59 

B* ri +MBD 

0.98 

1.06 

3.23 

1.54 

B«+MBD 

64.29 

0.14 

46.40 

0.13 

Btri+FSD 

16.67 

1.14 

17.34 

1.22 

Bzw+FSD 

68.37 

0.14 

45.60 

0.08 

B^+KFSD 

57.84 

1.63 

49.19 

1.52 

Bfa-i+KFSD 

58.16 

0.20 

28.00 

0.13 

B„ ei +FMD 

2.94 

1.10 

3.63 

0.84 

B.ri+FMD 

51.02 

0.12 

23.60 

0.00 

B„ ei +HMD 

60.78 

1.25 

42.74 

0.76 

B mei +HMD 

38.78 

0.06 

10.80 

0.02 

B„ ei +RTD 

15.69 

1.92 

17.34 

1.73 

B mej +RTD 

37.76 

0.49 

25.20 

0.15 

B„ ei +IDD 

23.53 

1.33 

14.52 

1.22 

B mei +IDD 

43.88 

0.67 

28.00 

0.42 

B„ ei +MBD 

0.98 

1.29 

2.82 

1.14 

B mei +MBD 

56.12 

0.10 

25.20 

0.02 

B„ ei +FSD 

15.69 

1.16 

12.10 

0.84 

B mei +FSD 

63.27 

0.06 

29.20 

0.00 

B„ ei +KFSD 

56.86 

1.12 

41.53 

0.67 

B mei +KFSD 

58.16 

0.12 

21.20 

0.00 

FBG 

86.27 

2.65 

78.63 

1.73 

FBG 

9.18 

0.53 

6.80 

1.09 

FHD 

49.02 

1.02 

65.73 

2.88 

FHD 

51.02 

1.02 

37.60 

4.34 

KFSD smo 

89.22 

3.90 

73.79 

2.95 

KFSD,™, 

87.76 

2.16 

50.00 

1.24 

KFSD( r i 

90.20 

4.63 

83.47 

4.71 

KFSD (ri 

91.84 

3.00 

64.80 

2.91 

KFSD mei 

97.06 

8.96 

90.32 

6.50 

KFSD mei 

95.92 

5.08 

62.00 

3.35 


Table 6: MM5, a = {0.02,0.05}. Correct 
(c) and false (f) outlier detection percentages 
of FBP, B tr i, B wei , FBG, FHD, KFSD smo , 
KFSD tr j and KFSD wei . 


Table 7: MM6, a = {0.02,0.05}. Correct 
(c) and false (f) outlier detection percentages 
of FBP, B tri , B wei , FBG, FHD, KFSD smo , 
KFSD tri and KFSD wei . 



a = 0.02 

a = 

0.05 


a = 

0.02 

a = 

0.05 


c 

f 

c 

f 


c 

f 

c 

f 

FBP+FMD 

55.56 

0.00 

54.00 

0.00 

FBP+FMD 

48.42 

0.00 

44.19 

0.00 

FBP+HMD 

66.67 

0.00 

68.40 

0.04 

FBP+HMD 

60.00 

0.18 

62.92 

0.00 

FBP+RTD 

57.58 

0.00 

54.40 

0.00 

FBP+RTD 

55.79 

0.00 

54.68 

0.00 

FBP+IDD 

52.53 

0.00 

56.00 

0.00 

FBP+IDD 

46.32 

0.00 

40.07 

0.00 

FBP+MBD 

55.56 

0.00 

55.20 

0.00 

FBP+MBD 

48.42 

0.00 

45.69 

0.00 

FBP+FSD 

55.56 

0.00 

55.60 

0.00 

FBP+FSD 

52.63 

0.00 

52.43 

0.00 

FBP+KFSD 

60.61 

0.00 

59.20 

0.00 

FBP+KFSD 

57.89 

0.00 

56.93 

0.00 

B ir i+FMD 

3.03 

0.18 

2.80 

0.44 

Bb-i+FMD 

29.47 

0.22 

33.71 

0.32 

1 1 .-1 1 \ 11 ) 

97.98 

0.12 

92.40 

0.11 

Bh-i+HMD 

71.58 

0.24 

45.69 

0.15 

B (rj +RTD 

16.16 

1.06 

20.00 

1.03 

B iri +RTD 

35.79 

0.82 

31.09 

0.51 

Et r i+IDD 

18.18 

1.06 

16.00 

1.07 

B, r( 1 TDD 

38.95 

0.37 

35.96 

0.74 

B (ri +MBD 

2.02 

0.16 

3.20 

0.32 

Bj^+MBD 

29.47 

0.24 

31.09 

0.32 

Etri+FSD 

29.29 

0.18 

27.20 

0.23 

TVt+FSD 

52.63 

0.20 

43.82 

0.19 

B (ri +KFSD 

93.94 

0.24 

92.40 

0.21 

Bfr-j+KFSD 

71.58 

0.22 

50.56 

0.21 

B„d+FMD 

3.03 

0.29 

2.40 

0.23 

B„ e +FMD 

23.16 

0.24 

19.48 

0.08 

B TOe »+HMD 

93.94 

0.08 

73.60 

0.00 

B mei +HMD 

68.42 

0.12 

35.96 

0.00 

B mei +RTD 

15.15 

1.06 

17.60 

1.12 

B mei +RTD 

38.95 

0.69 

24.34 

0.51 

B mei +IDD 

25.25 

0.98 

20.00 

0.99 

B mei +IDD 

33.68 

0.59 

25.09 

0.40 

B mei +MBD 

2.02 

0.20 

3.60 

0.21 

B mei +MBD 

24.21 

0.18 

19.85 

0.13 

B mei +FSD 

29.29 

0.14 

21.60 

0.13 

B mei +FSD 

47.37 

0.16 

27.72 

0.08 

B mei +KFSD 

83.84 

0.08 

72.00 

0.04 

B„, ei +KFSD 

66.32 

0.12 

44.19 

0.06 

FBG 

0.00 

1.02 

0.40 

0.04 

FBG 

17.89 

0.02 

14.98 

0.06 

FHD 

4.04 

1.96 

12.80 

5.64 

FHD 

52.63 

1.02 

61.80 

2.85 

KFSD smo 

98.99 

1.82 

94.00 

0.44 

KFSD smo 

91.58 

2.08 

71.16 

0.95 

KFSD (H 

98.99 

2.61 

98.00 

2.11 

KFSD (ri 

93.68 

2.69 

82.02 

2.49 

KFSD mei 

100.00 

4.61 

98.40 

2.11 

KFSD mei 

96.84 

4.69 

83.15 

2.75 
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best method (B^gj+KFSD), that is, we observe at least 20% differences. 

3. About MM3, KFSD we j is clearly the best method in terms of correct detection, however 
at the price of having a greater false detection (f). This is in general the main weak 
point of KFSD smo , KFSD tri and KFSD^gj. As for correct detection, we observe a 
overall pattern in our methods in false detection, but in an opposite way, indicating 
therefore a trade-off between c and f. Relative high false detection percentages are 
however something expected in KFSD smo , KFSD tri and KFSD^gj since these methods 
are based on the definition of a desired false alarm probability, which is equal to 10% in 
this study. Concerning MM2, we observe similar results to MM3, but in this case the 
performances of the best methods in terms of correct detection (KFSD smo , KFSD iri , 
KFSD wei , FBP-based methods and B tri when used with local depths) are closer to each 
other. 

Finally, there are only 2 cases in which a competitor outperforms all our methods, 
and it is FBAG under MM1 and both a. ffowever, this procedure does not show a 
behavior as stable as KFSD smo , KFSD iri and KFSD^gj do. Indeed, FBAG shows poor 
performances under other models, e.g., MM2. 

In summary, the above results and remarks show that the proposed KFSD-based pro¬ 
cedures are the best methods in detecting outliers for the considered models. Moreover, 
KFSD tr .j seems the most reasonable choice to balance the mentioned trade-off between c 
and f. In terms of correct detection, KFSD we j slightly outperforms KFSD iri , which however 
shows very good and stable performances when compared with the remaining methods. In 
terms of false detection, KFSD tri considerably improves on KFSD^ e j, especially under some 
models (e.g., see MM2). 

In Figure [5] we report a series of boxplots summarizing which percentiles have been 
selected in the training steps for KFSD smo , KFSD tri and KFSD^gj, and the following general 
remarks can be made. First, MM6 is the mixture model for which lower percentiles have 
been selected, and it is also a scenario in which our methods considerably outperform their 
competitors. The need for a more local approach for MM6-data may explain the two observed 
facts about this mixture model. Second, lower and more local percentiles have been chosen 
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Figure 5: Boxplots of the percentiles selected in the training steps of the simu¬ 
lation study for KFSD smo , KFSD^j and KFSD„, e j. 

for mixture models with nonlinear mean functions (MM4, MM5 and MM6) than for mixture 
models with linear mean functions (MM1, MM2 and MM3). Finally, the percentiles selected 
by means of the proposed training procedure seem to vary among data sets. However, except 
for MM3 and a = 0.02, at least for half of the data sets a percentile not greater than the 
median has been chosen, which implies at most a moderately local approach. 


5 REAL DATA STUDY: NITROGEN OXIDES 

(NO*) DATA 

Besides simulated data, we consider a real data set which consists in nitrogen oxides (NO x ) 
emission level daily curves measured every hour close to an in dustr ial area in Poblenou 
(Barcelona) and is available in the R package fda.use (Febre ro and Oviedo de la Fuente 
2012). Outlier detection on this data set was first performed by F ebrero et al ( 2008 ) where 
these authors proposed B tri and B wei . We carry on their study considering more methods 
and depths. 
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NO* are one of the most important pollutants, and it is important to identify outlying 
trajectories because these curves may compromise any statistical analysis or be of special 
interest for further analysis and to implement environmental political countermeasures. The 
NO* levels that we consider were measured in \xg/m? every hour of every day for the period 
23/02/2005-26/06/2005. Only for 115 days of the period are available the 24 measurements, 


and t 


rese are the days that compose the final NO* data set. Moreover, following 


Febrero et al 


(120080 . since the NO* data set includes working as well as nonworking days, it seems more 


appropriate to consider a first sample of 76 working day curves (from now on, W) and a 
second sample of 39 nonworking day curves (from now on, NW). Both W and NW are 
showed in Figure ( 6 j where it is possible to appreciate at least two facts that justify the split 
of the original data set. First, the W curves have in general higher values than NW curves, 
which can be explained by the greater activity of motor vehicles and industries in a city 
like Barcelona during working days. Second, both data sets contain curves with peaks, but 
for W curves the peaks occur roughly around 7-8 a.m. and during many days, whereas for 
NW curves the peaks occur later and during few days, which again can be explained by the 
differences between Barcelona’s economic activity of working and nonworking days. 


w 
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15 


20 


NW 



0 5 10 15 20 

Figure 6: NO* data: working (top) and non working (bottom) day curves. 
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At first glance, each data set may contain outliers, especially partial outliers in the form 
of abnormal peaks, and therefore a local depth approach by means of KFSD smo , KFSDf r ,; 
and KFSD^ e i appears to be a good strategy to detect outliers. Besides them, we do outlier 
detection with all the methods used in Section [4j For all the procedures we use the same 
specifications as in Section SI and we assume a = 0.05. For each method, we report the 
labels of the curves detected as outliers in Table [ 8 ] and we highlight these curves in Figured 


Table 8: NO^ data, Working and Nonworking 
data sets. Curves detected as outliers by FBP, 
B tri, B wei , FBG, FHD, ICFSD smo , KFSD tri and 
KFSD^ei- 



working days 

nonworking w days 


detected outliers 

FBP+FMD 

- 

- 

FBP+HMD 

12, 16, 37 

5, 7, 20, 21 

FBP+RTD 

37 

20 

FBP+IDD 

- 

5, 7, 20 

FBP+MBD 

- 

- 

FBP+FSD 

37 

- 

FBP+KFSD 

12, 16, 37 

5, 7, 20, 21 

Btri+FMD 

16, 37 

7 

B tri +HMD 

14, 16, 37 

7, 20 

Bf r i+RTD 

16 

7, 20 

Bf r i+IDD 

16, 37 

7, 20 

B tri +MBD 

16, 37 

7 

B* ri +FSD 

14, 16, 37 

- 

B (ri +KFSD 

12, 14, 16, 37 

7, 20 

B^+FMD 

16 

7, 20 

B^+HMD 

16, 37 

7, 20 

Bu,ei+RTD 

16 

- 

B^+IDD 

16, 37 

20 

B^+MBD 

16 

7 

B^+FSD 

16, 37 

- 

Bu; e i+KFSD 

16, 37 

7, 20 

FBG 

16, 37 

- 

FHD 

12, 14, 16, 37 

7, 20 

KFSD smo 

14, 16, 37 

7, 20, 21 

KFSD tri 

12, 14, 16, 37 

7, 20, 21 

KFSD„ 0 i 

11, 12, 13, 14, 15, 16, 37, 38 

7, 20, 21 


Concerning W, most of the methods detect as outlier day 37, the Friday at the beginning 
of the long weekend due to Labor’s day in 2005 and whose curve shows a partial outlying 
behavior before noon and at the end of the day. Another day detected as outlier by many 
methods is day 16, another Friday before a long weekend, Easter holidays in 2005, and whose 
curve has the highest morning peak. In addition to curves 16 and 37, KFSD smo detects as 
outlier curve 14, as other nine methods do, recognizing a seemingly outlying pattern in early 
hours of the day. Additionally, KFSD iri includes among the outliers also day 12, which 
may be atypical because of its behavior in early afternoon. Note that both day 12 and 14 
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Figure 7: N0 X data set, curves detected as outliers in Table 0 working (top) 
and nonworking (bottom) days. 


are in the week before the above-mentioned Easter holidays. Finally, KFSD^ej detects as 
outliers the greatest number of curves. This last result may appear exaggerated, but all the 
curves that are outliers according to KFSD^e* seem to have some partial deviations from 
the majority of curves. For example, day 13, whose curve is considered normal by the rest 
of the procedures, shows a peak at end of the day. Similar peaks can be observed also in 
other curves detected as outliers by other methods (e.g., days 16 and 37), which means that 
it may be occurring a masking effect to day 13’s detriment, and only KFSD,^ points out 
this possibly outlying feature of the curve. Regarding the training step for KFSD to set a, 
it gives as result the 70% percentile. Observing the first graph of Figure El it can be noticed 
that some curves have a likely outlying behavior, and this may be the reason why a weakly 
local approach for KFSD may be adequate enough. 

In the case of NW, some methods detect no curves as outliers (e.g., all the FSD-based 
methods), exclusively three FBP-based methods flag day 5 as outlier, whereas days 7, 20 
and 21 are detected as outliers by our methods as well as others. Note that day 7 is the 
Saturday before Easter and days 20 and 21 are Labor’s day eve and the same Labor’s day. 
Days 7 and 20, which have two peaks, at the beginning and end of the day, are also flagged 
by other twelve and eight methods, respectively, while day 21, which shows a single peak in 
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the first hours of the day, is considered atypical by only two other methods, which happen to 
be local (FBP+HMD and FBP+KFSD). This last result may be connected with what has 
been observed at the KFSD training step for selecting the percentile, i.e., the selection of 
the 30% percentile. Therefore, KFSD smo , KFSD ir j and KFSD^ e j work with a strongly local 
percentile, and their results partially resemble the ones of the previously mentioned local 
techniques. 


6 CONCLUSIONS 

This paper proposes to tackle outlier detection in functional samples using the kernelized 
functional spatial depth as a tool. In Theorem Q] we presented a probabilistic result allowing 
to set a KFSD-threshold to identify outliers, but in practice it is necessary to observe two 
samples to apply Theorem [0 To overcome this practical limitation, we proposed KFSD smo , 
KFSD tr j and KFSD„, e j which are methods that can be applied when a unique functional 
sample is available and are based on both a probabilistic approach and smoothed resampling 
techniques. 

We also proposed a new procedure to set the bandwidth a of KFSD that is based on 
obtaining training samples by means of smoothed resampling techniques. The general idea 
behind this procedure can be applied to other functional depths or methods with parameters 
that need to be set. 

We investigated the performances of KFSD smo , KFSD tri and KFSD we j by means of a 
simulation study. We focused on challenging scenarios with low magnitude, shape and partial 
outliers instead of high magnitude outliers. The results support our proposals. Along the 
simulation study, KFSD smo , KFSD^j and KFSD^j attained the largest correct detection 
performances in most of the analyzed setups, but in some cases they paid a price in terms of 
false detection. However, KFSD smo , KFSD tri and KFSD„, ei work with a given desired false 
alarm probability, and therefore higher false detection percentages than their competitors 
are due to the inherent structure of the methods. We also observed a trade-off between c 
and f for KFSD smo , KFSD tr j and KFSD^ej, and a clear pattern. For these reasons in our 
opinion KFSD tr j should be preferred to KFSD smo or KFSD^ ei since it performs extremely well 
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in terms of correct detection, while it has lower false detection percentages than KFSD we j. 
Concerning the remaining methods, there are competitors that in few scenarios outperformed 
our methods. However, in these few cases the differences are not great, and in addition these 
competitors are not stable across the considered scenarios. 

Furthermore, we also showed that our procedures can be applied in environmental con¬ 
texts with an example where the goal was to detect outlying NOj, curves to identify days 
possibly characterized by abnormal pollution levels. 

To conclude, we present two possible future research lines. First, since KFSD is a depth 
whose local approach is in part based on the choice of the kernel function, it would be 
interesting to explore how the choice of different kernels affects the behavior of KFSD. 
Moreover, each kernel will depend on a bandwidth and a norm. For the selection of the 
bandwidth, we used a criterion based on the study of the empirical distribution of the 
sample distances, but alternati ves s hould be investigated, for example an adaptation of the 
so-called Silverman’s rule flSilvermanl 1986j) for selecting the bandwidth of a kernel-based 
functional depth such as KFSD. For the choice of the norm, a sensitivity study would help 
in understanding how important is the functional space assumption. Second, since outlier 
detection can be seen as a special case of cluster analysis (it is a cluster problem with 
maximum two clusters, and one of them with size much smaller than the other,even 0), a 
natural step ahead in our research may be the definition of KFSD-based cluster analysis 
procedures. 
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A Appendix 


A.l From FSD{x , Y n ) to KFSD(x , Y n ) 

To show how to pass from FSD(x,Y n ) in ([[]) to KFSD(x,Y n ) in ([4j), we first show that 
FSD(x,Y n ) can be expressed in terms of inner products. We present this result for n = 2. 
The norm in (0Q) can be written as 


v ^2 x— 
\\x— 


x-ViW 


x-yi 


+ 


x-m 


\\x—yi\\ \\x-y2W 

x-yi 


+ 


X-V2 


Y + — y/(x,x) + (y 2 ,y 2 )-2(x,y2) 

Let 5i = y/{x,x) + (yi,yi) -2(x,y 1 ) and S 2 = \J{x, x) + {y 2 ^ 2 ) ~ 2{x, y 2 )■ Then, 


E 2 x-yj 

i \\x-yi\\ 


x-y 1 1 x-y 2 

Si ^ S 2 


x-y 1 
Si 


+ 


t +sk( x -y^ x ~y^) 


= 2 + x) + ( 2 / 1 , y 2 ) - {x, yi) - (x, y 2 )) 

E 2 (x,x) + (y,.y ] )-{x,y,)-(x,y ] ) 

i,j =1 <5i<5j ’ 

and apply the embedding map 0 to all the observations of the last expression. According to 
((2j), this is equivalent to substitute the inner product function with a positive definite and 
stationary kernel function k, which explains the definition of KFSD(x , Y n ) in (J4J) for n = 2. 
The generalization of this result to n > 2 is straightforward. 


A.2 Proof of Theorem \1\ 

As ex plained in Section |3] Theorem |Tj is a functional extension of a result derived by 
Chen et all ( 2009 ) for KSD, and since they are closely related, next we report a sketch 


of the prop 


McDiarmid 


’ of Theorem [Q The proof for KSD is mostly based on an inequality known as 
s inequality (jMcDia rm id 1989 (1. which also applies to general probability spaces, 


and therefore to functional Hilbert spaces. We report this inequality in the next lemma: 


Lemma 1 (McDiarmid 1989 [1-2]) Let ... ,Q n be probability spaces. Let f2 = 


Q” Qj and let X : f2 —>■ R be a random variable. For any j £ {1,..., n}, let (uq,..., ujj, 
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c o n ) and (cui, ... ,u>j,... ,tu n ) be two elements of ft that differ only in their jth coordinates. 
Assume that X is uniformly difference-bounded by {cj}, that is, for any j 6 {1,... ,n}, 


-Y (tJi, • • ■ , OJj , , Ujf) X (cUi , . . . , U!j , ... , CU n ) | f: Cj . 


( 12 ) 


Then, ifE[X] exists, for any r > 0 


Pr (X - E[X] > r) < exp 


—2r 2 

T n c 2 

x, j=\ 


In order to apply Lemma [Tj to our problem, define 


1 nz 

X(Z U ■ ■ ■ , Z nz ) = 'y ^ Y ny n Y )> 

nz . 

1=1 

whose expected value is given by 


(13) 


E l-M - E *i| Y n 


n z 


'y Y n 11 n ) 

n r, Z - J 


n z ‘ , 

i=l 


— ~ E zi \y [g(zi,Y nY \Y nY )] 


Now, for any j G { 1 , ..., nz} and Zj G H, the following inequality holds 


\X(yZ\, . . . , Zji . . . , Z nz ) X (zi, . . . , Zj, ..., z nz ) | 5 ; ; 

nz 

and it provides assumption (TT2l) of Lemma [U Therefore, for any r > 0 


(14) 


n z 


Pr \ 9 (zi,Y nY \Y ny )\ - — ^2g(zi,Y nY \Y nY ) > r ) < exp (-2-n z r 2 ) , 

and by the law of total probability 


n z , 
1=1 


E 


Pr ( E *i|r„ y [s(*i, Y nY \Y nY )\ - A- Yfff x g(zi , Y ny \Y ny ) > r) 

= Pr (E 21 | y „ y \g(z i, Y ny )} - ± ^=i 9{zi, Y ny ) > r) < exp (-2 n z r 2 ) 

Next, setting 5 = exp (— 2nzr 2 ), and solving for r, the following result is obtained: 


r = 


' In 1/8 
2 n z 









Therefore, 


Pr ( E *i| Y ny [g{z^Y ny )\ < ^-Y^g(zi,Y nY ) + 


>1-5. 


However, Theorem [Tj provides a probabilistic upper bound for E^y [g(x,Y ny )\. 
recall that Z\ ~ Y mix and note that 


(15) 

First, 


lE , (^i~i' mia .)|y„ y [g (^ 1 , (1 ®)®( 2 i~ynor)|[n (^i) ^riy)] + QiE(z 1 ~y ou t)|y nY . [fl 1 (^i? ^ny)] • 

Then, since E (zi ^y nor ,)|y nr [3 (*i, Y ny )] = E*^ [3 (x, Y„ y )], for a > 0, 

E z|Yny Id ( X lY ny )] Y ~Y m i x )\Yn y [d ( Z 1 J ) ] ■ (16) 

Consequently, combining (fT5]l and (fib]) , and for r > a, we obtain 

Pr ^ E x|y ny \d(x,Y ny )\ < 

which completes the proof. 


n z 


'y ^ d(zii Y ny ) + 


'In 1/5 
2 n 7 , 


>1-5, 
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